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ABSTRACT 


In today's information-driven business environment, data is the fuel for growth in both data-driven and non-data-driven 
organizations. Channelized analysis and visualization of data proved to be key factors in improving the business models 
and the overall performance of any organization hence EDA is one of the key processes which helps organizations to make 
informed business decisions by examining or understanding the data and extracting meaningful insights from the data by 


exposing trends patterns and the relationships that are not readily apparent. 


By looking at and analyzing the research done previously in the field of EDA, we have identified a few gaps in 
the process of performing EDA on a dataset. Firstly, the data visualization techniques and ways that were primitively used 
and no longer contain any significance need to be made more efficient and precise. Another concern is that most of the 
time and effort while performing EDA goes into data cleaning and pre-processing, after which the actual data analysis 
and insight gathering begin. To improve the efficiency of the process, this research paper introduces automation in the 
process of EDA. 
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INTRODUCTION 


In today’s digital world, insights obtained from Exploratory Data Analysis (EDA) are used in strategic business 
decision-making. EDA is a fundamental procedure that makes use of statistical techniques and graphical 
representation to obtain insights from data. EDA not only assists with the identification of hidden patterns and 
correlations among attributes in data but also helps with the formulation and validation of hypotheses from the data. 
Over the last few decades, interactive visualization strategies have become an integral part of data exploration and 


analysis techniques. 


With a picture being worth a thousand words, academics have proposed several tools and techniques to 
visualize complex relationships among data attributes using simple diagrams and charts. Whilst some of these 
visual data analysis tools assist with domain-specific data analysis (for example, analysis of genome sequence 
data, meteorological data, results of predictive analysis), some other tools focus on general-purpose exploratory 
browsing of tabular data. In either case, since the beginning of visual interactive data analysis, almost all visual 
EDA tools perform a few common analytics tasks. In their work, as well as this paper has identified these basic data 


exploration tasks as sort, filter, aggregate, correlate, group, and derive attributes. 


Now, since these repetitive tasks amount to a great deal of time and effort to perform, again and again, 
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there needs to be some way to increase the productivity and efficiency of the overall process. 
OBJECTIVES 


e To automate the same repetitive tasks associated with data analysis such as Data cleaning, Data preprocessing, 


etc. 
e To change the traditional way of analyzing the data and using it to make meaningful decisions. 
e To reduce the processing period of raw data. 


e To promote the reusability of data by uploading it to the cloud so that it can be used as and when required without 


any preprocessing. 
BACKGROUND RESEARCH 
This paper has referenced the following papers ranging from 1984 to 2020 which are summarized as follows: 


Cleveland et al [6] suggested various techniques to improve the perception of graphs we generate during EDA 
which helps in more channelized analysis. Buja et al [3] addressed some of the issues with visualization of high 
dimensional and large data sets and also proposed various approaches to resolve these issues. Friendly M. et al [8] talked 
about various types in which data can be displayed using graphs and how insights can be gathered from it. Billard et al [2] 
emphasized how data can be analyzed and visualized more efficiently so that we may be able to gather that knowledge and 
information from it which is of practical use and application. Gelman et al [9] highlighted the Bayesian formulation of 
Exploratory Data Analysis and suggested different approaches to the same. Johnstone I.M. et al [13] addressed some of 
the statistical challenges which occur while analyzing high dimensional data and suggested some ways to deal with such 
situations. Chong Ho Yu [5] proposed a conventional conceptual framework of EDA with the help of data mining and 
resampling with the use of cluster detection, variable selection, and pattern selection. C. Chen [4] talked about the 
importance of information hidden in the data which can be extracted using meaningful visualization and gathering 
actionable insights from it. L. Yu et al [14] addressed various issues on time-varying data visualization and proposed some 
of the methods for automatic animation of such type of data. Lei Yang et al [15] EDA technology gets rapid development, 
and more and more EDA tools and software come out. The application, function, design flow, and some important 
development tools of EDA technology are introduced in his paper. S.A. Murphy [18] suggested the use of BI tools for 
visualization and creating dashboards to support library decision-making. Also proposed were various ways to make this 
process quick. Idreos S. et al [11] provided a basic overview of various data explorationtechniques which helps in drawing 
various kinds of conclusions from raw data. J. Wolfe [12] emphasized the importance of data in data visualization as to 
how to make data interactive and appealing. X. Li et al [20] emphasized the importance of large data visualization and 
suggested the use of advanced aggregate computation for the analysis of huge data sets. Godfrey P. et al [10] addressed 
some of the issues with performing EDA onlarge data sets such as the time required to preprocess such large data and 
many innovative solutions have been proposed. T.J. Brigham [19] introduced the concept of data visualization uniquely 
and intriguingly and emphasized story-telling about charts and correlations. R.R. Laher [17] informed about ‘Thoth’, a 
software for data visualization and statistics, discussed various functionalities of this software and how can this be an ideal 
choice for EDA. Battle L. et al [1] proposed dynamic pre-fetching of data tiles to make the processof EDA faster and as 


interactive as possible. Also emphasized the importanceof data management in EDA.El Hindi et al [7] discussed VisTrees, 
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a tool for visualizing and characterizing subgroups in a data set, and also talked about fast indexes for interactive data 
exploration. Yalcin M.A et al [21] introduced the concepts of expressive tabular data analysis along with the methods to 
make this process rapid and time-efficient. Rahul Reddy Nadikattu [16] elaborated on the modern techniques of research in 
the fields of Data Science, Data Analytics, and Data Visualization. Modern techniques included those which are time 
efficient and consistent. After reviewing these papers, it was found that no research work has been done in the field of 


EDA to make it a time-efficient process. 


METHODOLOGY 
System Architecture 


Data Set Upload 
Analysis 


Shape Plotting Graphs 
Statistical Analysis Types of Plots 
Specific Columns Line 

Value Count —_ 

Correlation Plot Pie 
Columns’/Rows Histogram 
Relation Between Box 

Columns, Entities Specifying columms 


Automated Analysis on Cloud 


MAJOR MODULES AND THEIR FUNCTIONALITIES 


Client Interface (Front end): It is the user interface that the user will use tointeract with the system and perform 


automated data analysis and visualization. 


It is designed using streamlit, which is an open-source Python library that makes it easy to create and share 


beautiful, custom web apps. 


Upload Feature: It enables the user to upload a dataset to perform EDA on that data. A user can upload data in 
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TXT, or CSV format for now but it willbe enhanced to accept data in more kinds of formats. 
Core Functionality Modules: 


Exploratory Data Analysis: In this module, several features of EDA will be automated for the user just with a 
single click of a button such as plotting the correlation plot using matplotlib and seaborn, drawing the pie plot by 


clickingon its button, and carrying out other statistical analysis for more precise and accurate analysis. 


Data Visualization: In this module, the various methods and ways for visualizing data are automated for the user 
like generating various kinds of charts such as bar charts, histogram, area, kde, etc, and that too using any single 
or multiple attributes at a time. All a user needs to do is to select the type of chart and the attribute as a parameter 


from the drop-down list. 


Database: At the current stage, the user needs to upload a data set every time he/sheneeds to perform EDA so the 
current progress is lost every time user closes the system but with the help of a database a user will be able to 
store the data on the database and his/her progress will not be lost. This module will also ensure dataintegrity and 


security. 


Login/Register Page: This module enables a new user to register himself on thesystem and then login into the 


system to perform EDA in a more secured and customized manner. 
Experimental Setup 


Pandas is a free Python software library for data analysis and data handling. Pandas provide various high- 
performance and easy-to-use data structures and operations for manipulating data in the form of numerical 


tables and time series. 


NumPy is a free Python software library for numerical computing on data that can be in the form of large 
arrays and multi-dimensional matrices. These multi-dimensional matrices are the main objects in NumPy where 


their dimensionsare called axes and the number of axes is called a rank. 


Scikit-learn is a free software library for Machine Learning coding primarily inthe Python programming 
language. It was initially developed as a Google Summer of Code project by David Cournapeau and was 


originally released in June 2007. Scikit-learn is built on top of other Python libraries like NumPy. 


Matplotlib is a data visualization library and 2-D plotting library of Python. You can use Matplotlib to create 
plots, bar charts, pie charts, histograms, scatterplots, error charts, power spectra, stemplots, and whatever other 


visualization charts you want. 


Seaborn is a Python data visualization library that is based on Matplotlib and closely integrated with the 
NumPy and pandas data structures. Seaborn has various dataset-oriented plotting functions that operate on data 


frames and arraysthat have whole datasets within them. 


Streamlit is an open source app framework in python language. It helps us createbeautiful web apps for data 
science and machine learning in a little time. It is compatible with major python libraries such as scikit-learn, 


Keras, pytorch, latex, numpy, pandas, matplotlib, etc. 
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IMPLEMENTATION 

Implementation of Modules/Algorithms 

Streamlit for Client Interface 

The client interface i.e. the UI designed for the user to interact with the system is designed using Stremlit in python. In 
other words, the first interaction of the user inthe system will be with the interface designed using streamlit. All the buttons 
like uploading data, the drop-down for asking the user his/her preference for data analysis, and other UI buttons are 


designed using streamlit. 


For example, the sidebar for selecting the type of activity in the system is designedusing the ‘sidebar’ and 


‘selectbox' functions. Similarly, the file upload option wasmade using ‘file uploader’ function of Streamlit. 


CO localhost:8502 Qerx* 2 0 [s] H 
- . Automated System of Exploratory 
os Data Analysis 
Home 
C  @ localhost:8502 *“Qerwan#eoa@g: 
Lean ; Automated System of Exploratory 
pier Data Analysis 
stalasl 


Login Section 


Password 
eee o Logged In as srajan 


Login 
Exploratory Data Analysis 


Select Activities 
Upload a Dataset 
EDA 


™ Drag and drop file here 
) aS _ ae Browse files 
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CG @ localhost:8502 ~*~Qerxr aug 


Menu 


signup - Automated System of Exploratory 
Data Analysis 


Create New Account 
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NewAccount 
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Signup 


This paper has analyzed the working of the system on a sample diabetes data set. Here are some of the crucial sections of 


the system with snapshots. 


Pandas for Reading the Data. The data which a user uploads using the functionalities made available using 


streamlit is read by using Pandas. Using this library we can read data inmany formats like txt, CSV, etc. 


CS @ locathost:8502 aenxr £06 


Exploratory Data Analysis 


Upload a Dataset 
Menu 


Drag and drop file he 
A ie eee Browse files 
*200MB perfile 


Login 
U: N 
cer Name diabetes.txt 23.3K2 > 
srajan 
Password = * 
1 85 29 r) 
aise 2 183 4 0 
1 89 66 23 94 
Login 4 0 137 40 35 168 
Select Activities 
Show Shape 
EDA . 
| Show Columns 
Summary 
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SEABORN FOR GENERATING HEAT MAP 


The Heat Map generated as one of the functionalities of our project is generated with the help of the Sea Born library in 


python. We used Sea Born’s function heatmapto generate the heat map for the given data set by specifying the required 


parameters. 
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MATPLOTLIB FOR GENERATING VARIOUS PLOTS 


The functions like ‘matshow’ are available in the Matplotlib library and were used for generating various kinds of plots for 


the user in the backend. This function was implemented using the ‘pyplot’ class of matplotlib. 
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NUMPY 


Numpy was used in the project to perform numerical calculations in the data and use those calculations to draw various 


plots and charts in the data visualization section as without these calculations these plots would have been nearly 


impossibleto draw. 


A numpy array provides much more efficient storage and data operations as the arraygrows larger 
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DATA VISUALIZATION 
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Users can visualize the data in various ways using the drop-down menu feature of the system 
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THREATS TO VALIDATE 


e For now, this research is suitable for small to medium-sized datasets. The system can be made to accept large 


datasets in times to come. 


e = There is a restriction on the types (such as CSV, xlsx, txt) of datasets that can be used with the system, but it can 


be made compatible with almost all kinds of datasets. 


e For now, automation has been applied to a limited number of ways of performing EDA and other data 


visualization but using a similar kind of approach automation can be introduced for various complex ways as well. 
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CONCLUSIONS 


For this research, various research papers have been studied and analyzed to find that in the field of EDA the traditional 


methods of analyzing and visualizing data have never been modified and automated to make the process efficient, hence 


this research has tried to bridge that gap in the process in the form of automating some of the most time-consuming stages 


in the process of EDA. 


For the research, a sample data set has been used to carry out various kinds of analyses on data to authenticate and 


validate the results obtained using this proposed innovative way of performing EDA. 


In the future, the same research can be utilized for additional developments in this proposed system as for now the 


system works for simple data sets but with more research, the same system can be used to provide multiple features with 


multi-dimensional data sets as well. 
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